Back

Bioinformatics Advances

Oxford University Press (OUP)

Preprints posted in the last 7 days, ranked by how well they match Bioinformatics Advances's content profile, based on 184 papers previously published here. The average preprint has a 0.16% match score for this journal, so anything above that is already an above-average fit.

1
Can Large Language Models Diagnose Primary Immunodeficiency from Patient-Described Symptoms?

Reteig, L. C.; Woloshin, S.; Maglione, P. J.; Farmer, J. R.; Ong, M.-S.

2026-05-27 allergy and immunology 10.64898/2026.05.26.26353818 medRxiv
Top 1%
3.7%
Show abstract

Patients with primary immunodeficiency (PID) often face prolonged diagnostic delays and may increasingly turn to large language models (LLMs) to interpret their symptoms during this period. We evaluated whether an LLM could recognize PID from symptom descriptions derived from interviews with 21 PID patients. In a prior study, we showed that GPT-4o identified PID in 96% of cases when prompted with physician-written patient histories (Rider et al., JACI, 2024). Here, when prompted with symptom descriptions in patients' own words, GPT-5 identified PID in only 7 cases (33%), although it more broadly suggested immune system issues in 18 cases (81%). The gap between these findings indicates that LLMs are sensitive to the language and framing of symptom descriptions, performing substantially worse when patients describe their own symptoms in everyday language than when clinicians summarize patient histories in structured medical terms. This study underscores the need to carefully evaluate how LLMs are used in patient-facing applications.

2
Algorithmic Versus Expert Rankings of Large Language Models in Peritoneal Dialysis Prescription Review: A Trap-Embedded Synthetic Benchmark

Wei, C.-H.; Lin, H.-J.; Lai, W.-W.; Lin, H. M.

2026-06-01 nephrology 10.64898/2026.05.28.26354383 medRxiv
Top 2%
2.4%
Show abstract

Background: Clinical LLM benchmarks rarely test whether algorithmic rankings agree with expert clinical judgment. We developed a trap-embedded peritoneal dialysis (PD) benchmark comparing multiple scoring constructs with blinded nephrologist ratings. Methods: We generated 125 synthetic PD cases containing 13 ISPD-aligned trap types. Five LLMs (Claude Sonnet 4.5, GPT-5.4, Gemini 3.1 Pro, DeepSeek-R1, Grok 4.1 Fast) evaluated each case three times at temperature 0 (1,875 calls). Primary outcome was must-identify TDR_must, analyzed with GEE and case-clustered bootstrap. Secondary analyses included a verbosity-sensitive alarm-burden proxy, WCS, relaxed-match scoring, WCS sensitivity analyses, and a 25-output blinded expert adequacy substudy. Must-identify kappa was 0.89 in Stage 1 and 0.92 in Stage 2. Results: Rankings were discordant. Recall ranked Claude (0.977) and GPT-5.4 (0.955) above the other models (0.86-0.90, p<0.0001). The alarm-burden proxy favored concise models (Grok 0.689; 21.6 vs 2.4 issues/case), while WCS produced a third ordering. In the expert substudy, inter-rater concordance was strong (rho 0.977), but WCS did not show a positive association with expert adequacy (rho -0.17, p=0.41). Conclusion: Clinical LLM rankings in PD prescription review depend strongly on scoring construct. Algorithmic metrics should be reported alongside blinded expert adequacy ratings and should not alone determine deployment.

3
Translational bioinformatics and machine learning framework for biomarker discovery, disease prediction, and patient profiling for precision medicine

Ahmed, Z.; Govindareddy, P.; DeGroat, W.; Narayanan, R.; Peker, E.; Zeeshan, S.

2026-05-27 genetic and genomic medicine 10.64898/2026.05.23.26353961 medRxiv
Top 3%
1.7%
Show abstract

Precision medicine aims to advance our ability from a "one-size-fits-all" approach to personalized and predictive healthcare across diverse populations. It promotes integration of multi-omics and phenotypic data to understand disease mechanisms and discover novel biomarkers and risk factors, which could be used to predict and prevent critical diseases in individual patients across diverse populations. The potential implications of precision medicine approach can accelerate our ability to classify patients at higher risk of developing critical diseases, improve diagnostic capabilities, develop deeper understanding of individual risk, investigate racial differences and demographic characteristics, and find relationships between genetic variants, expressions, and diseases. This study focuses on implementing an innovative and data driven framework of translational bioinformatics and Machine Learning (ML) techniques to analyze multi-omics, including RNA-seq and Whole-Genome Sequencing (WGS) data, generated using blood samples of randomly consented patients. First, we utilized bioinformatics pipelines to identify differentially expressed genes and their pathogenic and likely pathogenic variants for the downstream data analysis, annotation, and visualization. Then, applied a nexus of ML models for multi-omics biomarker discovery, disease prediction, density-based clustering, single-patient profiling, and pathogenicity classification. WGS data analysis supported the exploration of genetic variation and diversity among patients to identify known and novel biomarkers, whereas RNA-seq data analysis improved our understanding of functional and biological pathways that underlying disease states. We classified and clustered pathogenic variants and expressions across various genes and discovered numerous diseases leading risk factors. Our results include gene-disease associations and captured common pathways across the broader population, demonstrating a level of sensitivity and accuracy that has broad clinical implications. We validated our results through clinical records, and state of the science literature. This study delves into the strengths of multi-omics data integration and capabilities of ML application in genetically diverse and complex patient cohorts. Our approach has the potential to elucidate complex gene-disease interactions for genetically diverse populations, which can support earlier diagnoses for patients in many disease realms.

4
Beyond Identifier Matching: An Empirical Characterization of Failure Modes in Biomedical Knowledge Graph Integration

Hu, S.; Cheng, H.; Gillenwater, L.; Manpearl, K.; Mandava, A.; Wang, Y.; Pividori, M.; Stranger, B.; Krishnan, A.; Greene, C.; Gao, Y.

2026-05-28 health informatics 10.64898/2026.05.26.26354182 medRxiv
Top 3%
1.5%
Show abstract

Objective. Biomedical knowledge graphs (KGs) such as PrimeKG, Hetionet, UMLS, and PharmGKB are increasingly used as the substrate for downstream machine-learning, retrieval-augmented generation, drug-repurposing, and electronic health record (EHR) augmentation pipelines. The dominant assumption in published work is that integrating two or more such KGs is a tractable engineering step solved by identifier (ID) matching. This paper interrogates that assumption empirically. We quantify how much concept overlap survives realistic alignment, and we characterize the new failure modes introduced by the methods that practitioners reach for when ID matching is insufficient. Materials and Methods. We compared four widely used biomedical KGs (PrimeKG, Hetionet v1.0, the full UMLS Metathesaurus, and PharmGKB) across eleven node types using a tiered alignment pipeline: (1) direct ID matching for nodes sharing a primary vocabulary; (2) cross-ontology bridging using standard mappings (e.g., MONDO-DOID, HPO-UMLS, HPO-UMLS-MeSH for side effects, NCBI Gene-HGNC-UMLS, UBERON-FMA/SNOMEDCT_US/NCI/MeSH for anatomy); (3) ClinicalBERT cosine-similarity grouping at threshold >= 0.98 for over-segmented disease nodes, with a deterministic suffix-stripping canonicalizer; (4) exact name matching for ontology-poor types (anatomy, REACTOME pathways); and (5) embedding-based fuzzy matching with UMLS lookup (SapBERT and ClinicalBERT) for free-text microbiome concepts. We applied the pipeline to a 698-concept gut-microbiome benchmark spanning taxa, pathways, and disease labels, validated grouping decisions against the curated SSSOM mappings released by the MONDO project, and audited the ClinicalBERT consolidation against five clinical-genetics case studies drawn from the literature. Results. Per-type pairwise coverage was strikingly asymmetric. Genes/proteins and the three Gene Ontology categories aligned cleanly across PrimeKG and Hetionet (mutual coverage 94-99%), but disease overlap was sparse: only 0.7% of PrimeKG individual disease nodes mapped to Hetionet, rising to 2.0% after MONDO grouping (versus 78.7% and 18.4% from the Hetionet side). PrimeKG-to-UMLS coverage spanned 100% (effect/phenotype via HPO) down to 20.8% (REACTOME pathways), with drugs at 73.7% and anatomy at 58.8%. PrimeKG-to-PharmGKB drug coverage required up to two bridging hops (DrugBank -> UMLS -> RxNorm/ATC/MeSH). Bigger was not uniformly more complete: on a 698-concept microbiome drug benchmark, Hetionet missed 0 concepts while PrimeKG missed 16. ClinicalBERT-based grouping consolidated 22,205 raw MONDO disease nodes into 17,080 groups but introduced three reproducible failure modes documented in case studies: (i) peer over-merging: for example, all 22 osteogenesis imperfecta subtypes collapsed into a single node despite distinct severity classes; (ii) parent-child collapse: e.g. acute myeloid leukemia merged with myeloid leukemia, erasing the acute/chronic distinction that drives clinical management; and (iii) lexical false positives: neurofibromatosis and schwannomatosis grouped together despite cellular-pathology differences. Discussion. Identifier matching alone is a weak baseline for biomedical KG integration. Cross-ontology bridges and embedding-based consolidation expand coverage but do so at the cost of clinically meaningful resolution, and the resulting failures are systematic rather than random. Reporting only aggregate coverage statistics obscures these losses, which propagate silently into downstream tasks. Conclusion. We provide reusable per-type coverage tables, a taxonomy of three integration failure modes, and concrete recommendations for downstream studies that depend on a unified biomedical KG. We argue that future KG integration work should report per-type coverage and per-cluster confidence rather than aggregate match rates.

5
A priority index-based computational medicine framework (PimRNA) for prioritising personalised mRNA cancer vaccines

Fang, H.; Tan, T.

2026-05-29 oncology 10.64898/2026.05.26.26354114 medRxiv
Top 3%
1.5%
Show abstract

Background: The development of personalised mRNA cancer vaccines holds considerable promise for oncology, yet a significant translational gap persists between neoantigen identification and the selection of therapeutically impactful targets. Current approaches predominantly prioritise human leukocyte antigen (HLA) binding affinity and immunogenicity, often overlooking the systems-level biological context of the target. This can inadvertently favour immunogenic but biologically peripheral peptides that exert limited influence on tumour signalling networks, thereby constraining vaccine efficacy. Furthermore, mRNA therapeutics must satisfy additional design requirements, including favourable codon usage and favourable secondary-structure stability, which directly affect in vivo translation and half-life. A unified computational framework that integrates neoantigen discovery with network biology is therefore critically needed. Results: Here, we present PimRNA, a Priority index (Pi)-centric computational medicine framework that bridges this gap by unifying neoantigen identification, mRNA sequence optimisation, and gene interaction network analysis. First, high-confidence tumour-specific HLA class I and II neoantigenic peptides are identified from paired tumour-normal genomic and tumour transcriptomic data using NeoDisc. Second, the coding sequences of these peptides are optimised for stability and translational efficiency with LinearDesign, yielding a core set of neoantigen-encoding mRNAs. Third, a random walk with restart algorithm is applied to a knowledgebase of gene interactions to identify peripheral genes exhibiting significant network connectivity to core genes, generating a gene-predictor matrix in which each gene is assigned an affinity score reflecting its network proximity to immunogenic neoantigens. These scores are consolidated into a single, unified priority rating (0-5) for each gene, followed by subnetwork analysis that reveals therapeutically relevant gene modules. Application of PimRNA to breast cancer and melanoma datasets demonstrates that it successfully selects high-confidence immunogenic neoantigen candidates embedded within biologically meaningful tumour-specific networks. Conclusion: PimRNA provides a systems biology foundation for mRNA vaccine design, moving beyond isolated immunogenicity to prioritise targets that are both highly presented and central to tumour-relevant biological networks. This framework offers a generalisable strategy for the rational discovery and prioritisation of mRNA therapeutics, significantly advancing the field of computational medicine towards personalised cancer vaccines.

6
Boundary-Specific Failure Modes and Safety Trade-offs of Large Language Models in ChronicKidney Disease Renoprotective Therapy Review:A Stratified Synthetic Benchmark

Yeh, S.-E.; Lin, H.-J.; Lai, W.-W.; Lin, H.

2026-05-30 nephrology 10.64898/2026.05.28.26353938 medRxiv
Top 4%
1.1%
Show abstract

Background.Renoprotective therapies - SGLT2 inhibitors, finerenone, and renin-angiotensin system inhibitors (RASi) - remain underutilisedin chronic kidney disease (CKD). Large language models (LLMs) may detect therapy omissions, but their performance acrossCKD severity strata and at clinical decision boundaries has not been evaluated.Methods.We constructed 100 synthetic CKD vignettes (G3a-G5D; 75 with prespecified omissions, 25 decoys) and queried four LLMsthree times each at temperature 0 (1,200 calls). Omission criteria were adapted from KDIGO 2024, including an investigator-defined gray-zone RASi initiation criterion at eGFR<15. Two nephrologists independently classified a stratified 20-casesubset.Results.For SGLT2 inhibitor and finerenone omissions, all models achieved near-ceiling sensitivity (97-100%). For RASi, performancediverged at the eGFR<15 boundary: Grok 4.1 Fast 85% versus GPT-5.4 55%, Gemini 10%, DeepSeek 10%. Gap-detectioninter-rater agreement was perfect (kappa = 1.000). Clinically incorrect reasoning rates ranged from 0% (GPT-5.4) to 27%(DeepSeek R1); of 52 instances, 31 were factual pharmacology errors and 21 reflected conservative boundary-discordantreasoning. Reproducibility (Jaccard) ranged from 0.74 to 0.93.Conclusions.This boundary-aware synthetic benchmark showed that aggregate sensitivity can conceal clinically important operational-rulediscordance. Rule-based SGLT2 inhibitor and finerenone omissions were detected with near-ceiling sensitivity, whereas aninvestigator-defined gray-zone RASi criterion at eGFR<15 exposed model-specific boundary behaviour. Evaluation of LLM-based CKD decision support should report boundary-specific performance, reproducibility, and clinically incorrect reasoningalongside aggregate metrics.

7
Compatibility of National Food Composition Databases with USDA FoodData Central: A Seven-Country LLM-Based Analysis

Nakagawa, S.; Yamamoto, A.

2026-06-01 nutrition 10.64898/2026.05.23.26353942 medRxiv
Top 4%
0.9%
Show abstract

To evaluate the international interoperability of food composition databases, we assessed the compatibility of seven national food composition tables with USDA FoodData Central (FDC) using the LLM-based matching method reported previously (Nakagawa and Yamamoto, 2026). Databases from four English-speaking countries (Canada, United Kingdom, Australia, and New Zealand), South Korea, and Japan were compared with 8,158 USDA FDC entries (SR Legacy and Foundation Foods, excluding Survey/FNDDS). Match rates varied by country (62.0-89.7%) and food category. After excluding six USDA categories unsuitable for cross-national comparison, 45.2% of the remaining 6,290 entries were not matched by any country. Canada showed the highest concordance, reflecting shared North American food supply. Japan and South Korea showed similar low coverage for vegetables and spices. These findings suggest that while USDA FDC represents a practical foundation for a globally comprehensive food composition database given its breadth, systematic incorporation of country-specific foods and classification schemes will be necessary to achieve true international interoperability.

8
Locally adaptive conformal prediction intervals for polygenic score-based phenotype prediction via residual normalization and data-driven stratification

Yun, Y.; Hao, X.; Zhang, Y. D.

2026-05-30 genetic and genomic medicine 10.64898/2026.05.28.26354326 medRxiv
Top 5%
0.7%
Show abstract

Quantifying uncertainty in polygenic score (PGS)-based phenotype prediction is crucial for the integration of genomic data into precision medicine. While the PGS provides a fundamental pivot for point estimation, clinical decision-making necessitates the construction of well-calibrated prediction intervals that reliably encompass the true phenotypic values. However, phenotypic residuals are frequently characterized by complex heteroscedasticity and stratified variance structures across diverse demographic contexts. Existing approaches often rely on global calibration mechanisms, which fail to account for such localized variance structures and lead to systematic miscalibration within specific subpopulations. To bridge this gap, we propose Clustering-based Split Conformal Prediction with Normalized Residuals (C-SCNR), a versatile framework based on Split Conformal Prediction. By adopting residual normalization and incorporating a repetitive `split-and-cluster` mechanism, C-SCNR dynamically identifies latent error strata and applies fine-grained adjustments to the resulting intervals. Our framework requires no distributional assumptions regarding the phenotype, is compatible with any PGS method, and flexibly accommodates biologically-informed grouping. Simulation studies demonstrate that our framework consistently outperforms existing methods across diverse error distributions. In real-data applications analyzing Body mass index (BMI), Low-density lipoprotein (LDL) cholesterol, and High-density lipoprotein (HDL) cholesterol in the UK Biobank, C-SCNR effectively resolves the coverage deficiencies of existing methods in specific subgroups and consistently yields superior localized calibration. Overall, C-SCNR represents a flexible and powerful framework for constructing high-resolution context-specific prediction intervals, thereby facilitating more reliable clinical interpretations of polygenic risk.

9
Dried blood spot proteomics as a diagnostic framework for citrin deficiency

Totsune, E.; Nakajima, D.; Konno, R.; Mikami-Saito, Y.; Arai-Ichinoi, N.; Nishida, H.; Yagi, H.; Ishige, T.; Suzuki, H.; Shirota, M.; Takayama, J.; Takano-Asai, C.; Shimura, M.; Sasai, H.; Lee, T.; Kido, J.; Nakajima, Y.; Kobayashi, H.; Kikuchi, A.; Numakura, C.; Hamazaki, T.; Oishi, K.; Nakamura, K.; Kawashima, Y.; Ohara, O.; Wada, Y.

2026-05-28 genetic and genomic medicine 10.64898/2026.05.26.26354012 medRxiv
Top 5%
0.7%
Show abstract

Background: Citrin deficiency, caused by biallelic pathogenic variants in SLC25A13, must be identified early to prevent serious complications such as hyperammonemia and liver failure. However, clinical diagnosis is often delayed due to its nonspecific presentation and limited sensitivity of amino acid-based newborn screening methods. Although genome-based evaluations are being investigated to address these issues, concerns about their cost, turnaround time, variant interpretation ability, and data handling highlight the need for a more practical yet reliable alternative. We investigated the feasibility of applying proteomic approach on dried blood spots (DBS), which are routinely used in newborn screening. Methods: We performed untargeted liquid chromatography-tandem mass spectrometry to analyze the proteome of DBS using a previously developed "non-targeted analysis of non-specifically DBS-absorbed proteins" (NANDA) workflow. SLC25A13 protein abundance was quantified in individuals with biallelic loss-of-function mutations, compound loss-of-function/missense mutations, and heterozygous carriers; this was also evaluated in healthy and diseased controls representing relevant differential diagnoses. To leverage proteomic information, we derived a multivariate proteomic signature using feature selection and evaluated its performance with leave-one-out cross-validation. Biological relevance was assessed by enrichment analysis, and complementary transcriptomics was performed using RNA sequencing. Results: A total of 7,474 proteins, including SLC25A13, were consistently detected in DBS. SLC25A13 was undetectable in individuals with biallelic loss-of-function mutations. However, individuals with compound loss-of-function/missense genotypes showed reduced but measurable SLC25A13 levels, comparable to those observed in heterozygous carriers. In contrast, a compact 15-protein signature accurately identified individuals with compound loss-of-function/missense genotypes (AUC, 0.99; sensitivity, 1.00; specificity, 0.95). The signature was enriched for Ca2+-response, and transcriptomics showed downregulation of genes related to multimodal ion channels in affected individuals compared to controls. Conclusions: DBS-based proteomic profiling may assist in the diagnosis of citrin deficiency through SLC25A13-quantification and a biologically plausible multivariate signature. More broadly, this strategy offers a promising new diagnostic layer for protein disorders, providing a proteomic readout in a clinically practical DBS format with potential utility for future diagnostic and screening applications.

10
Keeping human in the loop: A three-phase generative AI workflow for research integrity in data-intensive science.A methodological case study using elite Ethiopian distance-running data

Galko, P.; Yisamaw, A.; Haugen, T.; Seiler, S.

2026-05-29 sports medicine 10.64898/2026.05.29.26354013 medRxiv
Top 6%
0.5%
Show abstract

Background: Generative AI tools can support data-intensive research by writing code, drafting prose, searching analytical possibilities, and stress-testing claims. They can also produce false citations, drift between statistical specifications, and lose continuity across long investigations. This paper describes a practical workflow for using AI systems in empirical research while keeping discovery, verification, and accountability inspectable. Methods: We developed and applied a three-phase human-AI workflow to a case study of 14 elite Ethiopian distance runners. The dataset contained 22,605 GPS-segments collected across 97 consecutive days in late 2025, supplemented by venue and athlete metadata collected in the field. Phase 1 used an autonomous data-exploration tool to pre-filter the hypothesis space across five seeded research questions. Phase 2 used an AI system under direct human guidance to construct candidate findings into numerical claims, verification scripts, and draft text. Phase 3 used an independent AI system in an adversarial role to stress-test methods, statistics, prose, figures, and citations. The workflow was informed by Pearl's distinction between association, intervention, and counterfactual reasoning, with human judgement retained for research direction, interpretation, and final claims. Results: The workflow produced three empirical analyses and a documented correction process. The analyses estimated an altitude-to-sea-level pace correction of +0.10 min/km per 1,000 m at matched heart rate, showed why pooled altitude-surface regression was not identifiable within this venue system, documented method-dependence in heart-rate-based intensity classification, characterised within-venue route variation as a 64/36 path-fixed-to-trail-variable split with the Sululta label resolving into two functionally distinct sub-venues, and reframed the cohort's training through a 3x3x3 prescription lattice grounded in Ethiopian coaching practice. The adversarial phase identified several hallucinated citations, a terminology error between HC1 and cluster-robust standard errors, and several inconsistencies between prose, figures, and computed results. Verification scripts re-derived nearly all numerical claims from the cleaned lap-level data. Conclusions: The case study shows how researchers can organise AI-assisted empirical work so that candidate discovery, claim construction, independent stress-testing, and final accountability remain separated. The workflow did not remove the need for domain expertise or human judgement. Its value was in making the route from candidate finding to manuscript claim explicit, reproducible, and open to challenge. Trial registration: Not applicable.

11
Grounding Language Models in Behavioral Science to Scale Physical Activity Interventions for Hispanic/Latinx Populations

Mantena, S. D.; Johnson, A.; Schuetz, N.; Tolas, A.; Montalvo, S.; Delgado-SanMartin, J.; Ramirez Posada, M.; Du, L.; Zhang, S.; Huynh, A. D.; Oppezzo, M.; King, A. C.; Schmiedmayer, P.; Lawrie, A.; Rodriguez, F.; Ashley, E.; Kim, D. S.

2026-05-28 cardiovascular medicine 10.64898/2026.05.26.26354165 medRxiv
Top 7%
0.3%
Show abstract

Objective: Hispanic/Latinx populations in the U.S. experience higher rates of chronic disease linked to physical inactivity, yet digital health interventions remain largely inaccessible to more than 16 million Hispanic/Latinx adults with limited English proficiency. While large language models (LLMs) offer scalable personalization, their use in non-English behavioral coaching is unexplored. This study introduces MHC-Coach-ES, a Spanish-language LLM fine-tuned on the Transtheoretical Model (TTM) of behavior change. Materials and Methods: We fine-tuned Llama 3-70B-Instruct using a two-stage pipeline. First, the model was adapted to Spanish health and motivational language using a 2.21-million-token corpus. Second, it was instruction-tuned on 3,268 translated human written messages to align the model with the Transtheoretical Model (TTM) of Behavioral Change. We compared MHC-Coach-ES with Llama 3-70B-Instruct and translated human-expert messages using a forced-choice preference survey (N = 77) and blinded expert review (N = 2). Results: Spanish-speaking participants significantly preferred MHC-Coach-ES messages over translated human-expert messages (81% preference, P<0.001). Linguistic analysis showed that MHC-Coach-ES produced more temporally anchored messages than the base model (65% vs. 20%), while maintaining readability. In blinded evaluation, clinical experts rated MHC-Coach-ES higher for alignment with Transtheoretical Model stages than human-expert messages (4.83 vs. 4.38 out of 5). The base model also outperformed translated expert messages across preference and expert ratings. Conclusions: Generative AI can operationalize behavioral science frameworks in Spanish, offering a scalable approach to reducing health disparities. The strong performance of both MHC-Coach-ES and the base model highlights the promise of generative and personalized approaches over translation-based localization for theory-driven behavioral interventions.

12
Polyphenol Estimator: A New Tool to Estimate Dietary Polyphenol Intake from ASA24 and NHANES Dietary Data

Wilson, S. M. G.; Oliver, A.; Lemay, D. G.

2026-05-29 nutrition 10.64898/2026.05.27.26353727 medRxiv
Top 8%
0.3%
Show abstract

Background: Recent food-based recommendations for flavan-3-ols highlight a growing need to understand the breadth of our dietary polyphenol exposure. However, estimation of dietary polyphenol intake remains challenging, requiring custom computational tools that are often difficult to implement or not fully reproducible. Objective: We aimed to an automated, user-friendly tool to estimate polyphenol intake from diet recalls and records. Methods: We developed Polyphenol Estimator, a tool that processes dietary data from the Automated Self-Administered 24-Hour (ASA24) Dietary Assessment Tool or the Automated Multiple-Pass Method from the National Health and Examination Survey (NHANES). Polyphenol Estimator disaggregates foods using the FDA Food Disaggregation Database into ingredients, matches these ingredients to FooDB, and estimates polyphenol intake at the total, class, and compound level. Optionally, these polyphenol estimates can be used to calculate the Dietary Inflammatory Index (DII). Polyphenol Estimator is freely available online (https://swi1.github.io/polyphenol_estimator) with a tutorial for users with limited programming experience. Results: To illustrate Polyphenol Estimator, we applied it to two days of diet recalls from adults ([&ge;] 20 years) in NHANES 2021-2023 (n = 2778). For 97.7% of participants, less than 2.5% of reported foods went unmapped, with 75.7% of participants having complete mappings. Total polyphenol intake was 517 +/- 439 (mean +/- SD) mg/1000 kcal, largely from green tea, coffee, black tea, apples, wine, oranges, and blueberries. At the class level, polyphenols classified as organooxygen compounds, flavonoids, and cinnamic acids and derivatives were top intake contributors. At the compound level, cyptochlorogenic acid, neocholorogenic acid, and caffeic acid were top contributors. Lastly, the DII was 1.4 +/- 1.9, indicating the average diet had proinflammatory potential. Conclusions: Polyphenol Estimator offers an automated method to obtain total, class, and compound-level polyphenol estimates from dietary data to aid future efforts to understand polyphenol intake exposures and their biological impact on health.

13
Functionally informed annotation influences pathway-specific polygenic risk and disease inference in Alzheimer's disease

Bazemore, K.; Iqbal, T.; Kuzma, A. B.; Grant, S. F. A.; Schellenberg, G. D.; Wang, L.-S.; Chesi, A.; Jin, J.; Naj, A. C.

2026-05-26 epidemiology 10.64898/2026.05.25.26353905 medRxiv
Top 8%
0.3%
Show abstract

Pathway-specific polygenic risk scores (pathway-PRS) measure aggregate genetic risk across single nucleotide variants (SNVs) annotated to genes in a pathway of interest. In most applications, SNV-to-gene annotation is based on SNV position with respect to gene boundaries. This approach is ill-suited for incorporating non-coding SNVs, which can regulate gene expression over long distances and represent a large proportion of risk variants for Alzheimer's disease (AD). Here, we compare the performance of AD pathway-PRS across SNV-to-gene annotation strategies that integrate varying levels of functional genomic data, including adult brain chromatin interaction and expression quantitative trait loci (eQTL) data. In the UK Biobank (n=328,526), including AD cases defined by ICD-9/10 codes (n=3,043) and by family history of AD/dementia (n=38,589), we show that the annotation strategy integrating chromatin interaction and eQTL data consistently improves pathway-PRS performance. We replicate this finding in independent data from the Alzheimer's Disease Genetics Consortium (n=3,370). We further find that pathway-PRS associations with AD vary by annotation strategy and that power to detect sex-dependent and age-at-onset associations is increased with integrative annotation. Together, these findings support the use of functionally informed SNV-to-gene annotation for pathway-PRS construction and highlight the importance of applying multiple annotation strategies for robust inference.

14
Domain-based basal and ambulatory glycemic exposure metrics derived from continuous glucose monitoring: a real-world clinic-based study

Shinde, S. N.; Shinde, R. S.; Bhangaaley, S. Y.

2026-05-26 endocrinology 10.64898/2026.05.24.26353983 medRxiv
Top 9%
0.2%
Show abstract

Background: Consensus continuous glucose monitoring (CGM) metrics, including time in range (TIR), time above range (TAR), time below range (TBR), mean glucose, glucose management indicator, and glycemic variability, are essential for modern glucose assessment. However, these whole-day summaries do not explicitly partition nocturnal basal from daytime ambulatory glycemic burden. Objective: To develop and evaluate a complementary domain-based CGM framework that quantifies basal and daytime ambulatory glycemic exposure across oral glucose tolerance test (OGTT)-derived dysglycemia phenotypes. Methods: In this observational, clinic-based study, 253 individuals underwent OGTT with insulin measurement and CGM. Participants were classified using a prespecified OGTT-derived phenotyping algorithm, implemented through a deterministic rules-based web calculator, and collapsed into five groups: NoDM, Increased insulin resistance, Midzone Glycemia, Prediabetes, and Diabetes. CGM files were uniformly reprocessed by selecting the latest contiguous episode and retaining the most recent 15 calendar days with data. The 24-hour profile was partitioned into nocturnal basal (00:00 to <06:00) and daytime ambulatory (06:00 to <24:00) domains. Derived indices included Area of Basal Glycemia (ABG), Area of Prandial/Daytime Ambulatory Glycemia (APG), incremental ABG (iABG), incremental APG (iAPG), and exploratory deficit indices dABG and dAPG. Results: The final dataset contributed 3,647 analyzable CGM days. APG remained higher than ABG across all groups. Mean ABG/APG increased from 80.45/86.38 mg/dL in NoDM to 111.96/124.70 mg/dL in Diabetes. Mean iABG/iAPG increased from 5.65/6.60 to 34.12/38.91 mg/dL, whereas dABG/dAPG declined as dysglycemia worsened. Conclusions: The ABG/APG framework provides interpretable, domain-resolved CGM burden metrics that separate basal from daytime ambulatory exposure and distinguish total burden from above-threshold excess. These indices are proposed as adjunctive metrics to support dysglycemia phenotyping, early risk recognition, and treatment monitoring, but are not intended to replace established consensus CGM metrics or diagnostic criteria. External, prospective validation is required.

15
In vitro splice-switching oligonucleotide rescues aberrant GFM2 pseudoexon inclusion and restores mitochondrial activity

Gross, S.; Birnbaum, R.; Shaul Lotan, N.; Mor-Shaked, H.; Manor, J.; Shaag, A.; Rosenbluh, C.; Levy-Memo, A.; Yanovsky-Dagan, S.; Saada, A.; Harel, T.

2026-06-01 genetic and genomic medicine 10.64898/2026.05.28.26354078 medRxiv
Top 9%
0.2%
Show abstract

Background: Biallelic variants in GFM2, encoding mitochondrial elongation factor G2 (mtEFG2), a GTPase involved in the termination stage of mitochondrial translation, cause autosomal recessive combined oxidative phosphorylation deficiency. Noncoding structural variants may be missed by exome sequencing but can disrupt splicing and provide opportunities for variant-specific therapeutic rescue. We investigated the molecular mechanism underlying suspected Leigh syndrome in an infant with mitochondrial disease and evaluated whether splice-switching oligonucleotide (SSO) treatment could correct the pathogenic splicing defect. Methods: The proband underwent exome sequencing followed by short-read and long-read whole genome sequencing. RNA sequencing, reverse-transcription PCR, quantitative PCR, and cycloheximide treatment were used to characterize the effect of the identified intronic duplication on GFM2 splicing and transcript stability. Patient-derived fibroblasts were treated with SSOs targeting the aberrant splice junction. Rescue was assessed by RNA studies, western blotting, and spectrophotometric measurement of cytochrome c oxidase (COX). Results: Whole genome sequencing identified a paternally-inherited GFM2 missense variant, NM_032380.5:c.2195C>T p.(Pro732Leu), in trans to a maternally-inherited 221-nucleotide intronic duplication, NM_032380.5:c.2029-741_2029-521dup. RNA studies revealed a 87-nucleotide pseudoexon, generated by activation of a cryptic acceptor splice site within the duplicated sequence. The resulting transcript harbored a premature termination codon (PTC) and underwent nonsense-mediated decay, as confirmed by cycloheximide rescue. Together with reduced mtEFG2 protein levels on western blot, the findings supported a loss-of-function mechanism. Enzymatic analysis of affected fibroblasts showed reduced activity of the mtDNA-dependent complex IV subunit COX, with preservation of the nuclear-encoded complex II enzyme succinate dehydrogenase and the control enzyme citrate synthase, consistent with impaired mitochondrial translation. A SSO targeting the aberrant intron-pseudoexon junction nearly abolished pseudoexon inclusion, restored correctly spliced GFM2 transcript from the duplication-containing allele, increased mtEFG2 protein levels, and significantly improved COX activity. Conclusions: This study identifies a pathogenic intronic GFM2 duplication that causes mitochondrial disease through pseudoexon activation and nonsense-mediated decay. The findings demonstrate the value of integrated genome and transcriptome analysis for exome-negative mitochondrial disease and provide in-vitro proof of concept that SSOs can restore transcript processing, protein expression, and mitochondrial respiratory-chain function in patient-derived cells.

16
Measuring the Meaning of Genomic Results: Harmonization of the Metric for Case-Level Results in the CSER2 Consortium

Powell, B. C.; Amendola, L. M.; Bonini, K. E.; Crosslin, D.; Desrosiers-Battu, L.; Hiatt, S. M.; Hindorff, L.; Kenny, E. E.; Mavura, Y.; Muenzen Ferar, K. D.; Risch, N.; Roman, T.; Slavotinek, A.; Van Ziffle, J.; Bowling, K. M.

2026-06-01 genetic and genomic medicine 10.64898/2026.05.28.26354388 medRxiv
Top 9%
0.2%
Show abstract

Yield of reported results from genetic testing provides a proximal measure of clinical usefulness. While ACMG/AMP guidelines provide representations of uncertainty for individual genetic variant classification, additional factors are considered when determining whether results explain a patient's presentation. To standardize cross-consortium analysis, a working group of the Clinical Sequencing Evidence-Generating Research (CSER2) consortium iteratively identified factors used when contextualizing variant-level results to case-level interpretation (i.e., interpretation of an individual's genetic data with respect to the indication for testing). Sites independently categorized results; complex cases were discussed collaboratively, leading to revision of classification categories. Our metric incorporates factors beyond classification of reported variants. Analogous to variant-level results, "Definitive Positive" and "Probable Positive" represent certainty that results may be clinically explanatory. The category "Inconclusive" applies when results may or may not fully explain the patient presentation, with subdivision into multiple (non-exclusive) subcategories. Cases falling outside all of the other categories are considered "Negative". The overall diagnostic yield by this metric and use of categories for inconclusive results varied by CSER project, in part paralleling study design differences. This case-level categorization provides a meaningful assessment of diagnostic yield, and for inconclusive cases identifies potentially resolvable factors for case resolution.

17
Auditable cross-instrument detection of unusual multivariate psychiatric response configurations using a semantically aligned covariance subspace

Periwal, V.

2026-05-27 psychiatry and clinical psychology 10.64898/2026.05.22.26353902 medRxiv
Top 10%
0.1%
Show abstract

Background: Conventional psychiatric screening instruments summarize symptoms within individual scales and prioritize cases with high single-instrument additive score severity. This design treats items as independent within instruments and ignores cross-instrument covariance structure, making it insensitive to respondents whose responses are distributed across multiple domains in unusual combinations that remain below threshold on every individual scale. Methods: We analyzed two cohorts spanning older and younger adults. Item prompts from depression, stress, anxiety, and sleep instruments were embedded into a shared semantic space using a pretrained sentence encoder. Principal component analysis of the item-prompt embeddings alone---with no use of respondent data at this stage---was used to construct a low-dimensional subspace retaining 80\% of variance in the item embedding matrix. Normalized participant responses were then projected into this subspace, with Jaccard-based stability analysis used as a check on dimensional robustness. Multivariate deviation from the cohort norm was quantified with Mahalanobis distance using Ledoit-Wolf covariance regularization. Candidate outliers were defined by the empirical 95th percentile of the cohort-specific distance distribution. To isolate response configurations not already captured by conventional single-instrument extreme-value logic, we excluded all outlier respondents who had endorsed any individual item at the maximum value of its Likert scale on any instrument. For the remaining outliers, anomalous components were backtracked to their original item loadings for interpretation. Results: In the older-adult Health and Retirement Study (HRS) cohort, principal component analysis of 27 item-prompt embeddings showed that a 10-dimensional subspace provided a stable representation of cross-instrument semantic structure. In the younger-adult Xinxiang cohort the corresponding stable solution was 16-dimensional. In each cohort, seven respondents remained as multivariate outliers despite falling below every single-instrument extreme-value threshold. These cases were not characterized by uniformly severe symptom scores but by unusual cross-domain response configurations that became visible only in the shared semantic covariance subspace. The response structure of the retained configurations differed across cohorts: older-adult cases more often involved weak endorsement of mood-labeled items alongside nonzero body- and sleep-related responses, whereas younger-adult cases more often involved incomplete response configurations spanning mood, sleep, stress, and self-harm-related items. Conclusions: A semantically aligned, auditable covariance subspace provides a practical tool for flagging unusual multivariate response configurations that single-instrument additive screening may not flag. The method is interpretable at the level of original item contributions. It should be understood as a hypothesis-generating screen for unusual response configurations requiring further clinical assessment, not as a diagnostic instrument. Outcome validity remains to be established by prospective study.

18
Stratified evaluation of blood RNA sequencing in a rare disease cohort

Duzenli, T.; Durmus, S.; Kaya, H. E.; Sevilgen, F. E.; Kayhan, G.; Cakir, T.; Ergun, M. A.

2026-05-28 genetic and genomic medicine 10.64898/2026.05.27.26353804 medRxiv
Top 10%
0.1%
Show abstract

Background: RNA sequencing (RNA-seq) is increasingly recognized as a complementary tool to DNA-based sequencing for improving the diagnostic yield in Mendelian disorders. However, how the diagnostic performance of RNA-seq varies across molecularly and phenotypically distinct patient subgroups remains poorly defined. This study aimed to evaluate and compare the diagnostic utility of RNA-seq across three stratified groups of patients with non-diagnostic exome sequencing. Methods: We performed RNA-seq on whole blood samples from 90 patients with suspected Mendelian disease in whom clinical exome or whole-exome sequencing had failed to establish a molecular diagnosis. Patients were prospectively stratified into three groups of 30: (i) patients with a candidate variant of uncertain significance (VUS) with predicted splicing impact (Group 1), (ii) patients with a specific clinical pre-diagnosis but no identified pathogenic variant (Group 2), and (iii) patients without a specific pre-diagnosis or candidate variant (Group 3). Aberrant splicing, gene expression outliers, and allele-specific expression were analyzed using multiple bioinformatic tools and compared against a GTEx-derived control cohort. Results: RNA-seq contributed to a molecular diagnosis in 29 of 88 evaluable patients (32.9%). Diagnostic yield differed substantially across groups: 82.8% (24/29) in Group 1, 6.9% (2/29) in Group 2, and 10% (3/30) in Group 3. In Group 1, RNA-seq enabled reclassification of candidate VUS through direct demonstration of aberrant splicing events. In Group 2, RNA-seq identified a somatic mosaic ACTB variant missed by exome sequencing and reclassified a previously deprioritized APPL1 VUS. In Group 3, a deep intronic pseudoexon-activating variant in IGBP1 was identified in two siblings with severe microcephaly, providing evidence for a candidate X-linked microcephaly gene, and a pathogenic RNU4-2 variant was detected in a patient with ReNU syndrome, a non-protein-coding gene not captured by standard exome sequencing. Conclusions: RNA-seq has the highest diagnostic utility when applied to evaluate candidate splice variants identified by prior DNA testing but also provides independent diagnostic value in patients without candidate variants. The systematic comparison across stratified patient groups supports the integration of RNA-seq into clinical genomic workflows and highlights the need for standardized analytic frameworks.

19
Is it time for a paradigm shift? Tailored online video education instead of pretest genetic counseling facilitates high genetic test uptake and informed choice for adults seeking cardiovascular genetic testing

Rivers, B.; Murray, B.; Applegate, C. D.; Tichnell, C.; Gordon, C.; McClellan, R.; Brown, E.; Nunez, K.; Barth, A. S.; Taylor, C. O.; Yanek, L. R.; Day, J.; James, C. A.

2026-06-01 genetic and genomic medicine 10.64898/2026.05.28.26354394 medRxiv
Top 10%
0.1%
Show abstract

Background: Pretest genetic counseling (GC) is recommended in conjunction with genetic testing (GT) for cardiovascular (CV) indications, yet access to CVGC is limited leading to delayed GT. Posttest GC could increase GC and GT access but requires efficient pretest education that supports both informed GT decision-making and robust GT uptake. Methods: We developed four indication-tailored online CV genetics education videos and deployed them in a 3-arm randomized trial comparing pretest vs. posttest outpatient CVGC (RESEQUENCE-GC, NCT05422573). Participants were 1:1:1 randomized to pretest video education plus an optional (efficiency arm) or required (flipped arm) phone call with a genetic counselor and planned posttest CVGC or to standard pretest CVGC (SOC arm). Questionnaires administered at baseline and post-education included the CV Multidimensional Model of Informed Choice [MMIC] to quantify GT knowledge and informed GT choice. Results: 389/767 (50.7%) adults aged 18-80 (mean 51.2{+/-}14.9 years) scheduling a first CVGC appointment consented to RESEQUENCE-GC and completed the baseline questionnaire. Efficiency arm participants (video education + optional phone call) were most likely to complete pretest education (134, 97.4% efficiency; 107, 85.6% flipped; 111, 87.4% SOC, p=0.0012) and elect GT (131, 95.6% efficiency; 105, 84.0% flipped; 107, 84.2% SOC, p=0.0036). Few (4, 2.9%) efficiency arm participants requested an optional pretest phone call. Most flipped arm participants (90, 84.1%) had no post-video questions, consistent with the 97 second [IQR: 65s-145s] median call duration. CV genetics knowledge was high post-education (median 8 [IQR 7,8]/8 MMIC items correct). Only video-based pretest education was associated with a significant increase in knowledge (p<0.0001). Nearly all participants made an informed GT choice with no difference between intervention (95.6%) and SOC (90.4%) arms (p=0.074). Conclusions: Tailored, online video pretest education can enhance CV GT uptake, support informed GT decision-making, and be integrated into efficient pretest workflows, suggesting utility in scalable posttest CVGC.

20
Breath volatile profiling reveals a diagnostic signature of MASLD in children

Berna, A. Z.; Panganiban, J.; Liu, Y.; Logan, J.; Russo, P.; Aryal, A.; Hafertepe, K.; Abu-Alreesh, S.; DeBosch, B.; Stoll, J.; John, A. R. O.

2026-05-27 gastroenterology 10.64898/2026.05.26.26353794 medRxiv
Top 10%
0.1%
Show abstract

Background & Aims: Metabolic Dysfunction Associated Steatotic Liver Disease (MASLD) is the leading cause of chronic liver disease in children. However, accurate, noninvasive diagnostic tools remain limited. Current screening methods are invasive or lack sensitivity. Breath-based volatile organic compound (VOC) analysis offers a simple approach with potential for point of care screening. This study aimed to identify and validate breath VOC signatures of pediatric MASLD. Approach & Results: We conducted a prospective IRB approved cohort study at the Childrens Hospital of Philadelphia (CHOP). Children aged between 7 and 20 years with MASLD (n=22), as defined by hepatic steatosis either by liver biopsy or imaging and 1 cardiometabolic risk factor, and a control group without MASLD (n=20) were enrolled. Breath samples were collected using a standardized protocol and analyzed by untargeted comprehensive two-dimensional gas chromatography-mass spectrometry (GCGCMS). Machine learning and unsupervised clustering were applied to identify discriminatory VOCs and assess heterogeneity. Untargeted GCGCMS analysis identified a distinct breath VOC signature in children with MASLD compared with non MASLD controls. A Random Forest model achieved a sensitivity of 73% and specificity of 65%, with AUC of 0.84. The VOC 2,4-dimethyl-1-heptene demonstrated strong diagnostic performance in the discovery cohort with a sensitivity of 85%, specificity of 77% and an AUC of 0.81. Unsupervised clustering revealed four MASLD subgroups with distinct volatile phenotypes associated with differences in liver enzymes and metabolic parameters. External validation in a second pediatric cohort confirmed reproducible reductions in o/p-xylene in subjects with MASLD. Conclusions: Pediatric MASLD is associated with a reproducible breath VOC signature identified by untargeted GCGCMS. These findings support breath analysis as a scalable, noninvasive screening and stratification tool for pediatric MASLD and warrant validation in larger, longitudinal studies.